Data scraping is defined as using a computer to extract information, typically from human readable websites. We could spend multiple weeks on this, so this will be a basic introduction that will allow you to:
HTML elements are written with a start tag, an end tag, and with the content in between:
## {xml_nodeset (1)}
## [1] <h1>Andrew Hoegh </h1>
## [1] "Andrew Hoegh "
## [1] "Andrew Hoegh" "\n "
## [3] "Education" "Contact Information"
## [5] "Research Interests" "Teaching"
## [1] " Site Menu expand"
## [2] "Statistics Faculty"
## [3] "Assistant Professor of Statistics"
## [4] "Andrew HoeghMontana State UniversityBozeman, MT 59717"
## [5] "Office:Â Wilson 2-241Tel: (406) 994-5340andrew.hoegh @ montana.edu"
## [6] "Located in Bozeman, MT"
## [7] "For questions or comments contact the Ask Us Desk."
## [1] " Site Menu expand"
## [2] "Statistics Faculty"
## [3] "Assistant Professor of Statistics"
## [4] "Andrew HoeghMontana State UniversityBozeman, MT 59717"
## [5] "Office:Â Wilson 2-241Tel: (406) 994-5340andrew.hoegh @ montana.edu"
## [6] "Located in Bozeman, MT"
## [7] "For questions or comments contact the Ask Us Desk."
## [1] "Andrew Hoegh" " "
## [3] "Education" "Contact Information"
## [5] "Research Interests" "Teaching"
## [1] ""
## [2] ""
## [3] ""
## [4] "Search"
## [5] "Skip Navigation"
## [6] "Andrew Hoegh"
## [7] "Teaching"
## [8] "Research Interests"
## [9] "CV"
## [10] "Department of Mathematical Sciences"
## [11] "Andrew Hoegh"
## [12] "Ph.D. (2016) Virginia Tech, Blacksburg, VA"
## [13] "M.S. (2008) Colorado School of Mines, Golden, CO"
## [14] "B.A. (2006) Luther College, Decorah, IA"
## [15] "Phone: (406) - 994-5340"
## [16] "Email: andrew.hoegh @ montana.edu"
## [17] "Bayesian statistics"
## [18] "Statistical Ecology"
## [19] "Spatiotemporal Modeling"
## [20] "Computational Statistics"
## [21] "Sports Analytics"
## [22] "Applied environmental and ecological research"
## [23] "STAT 532 - Bayesian Statistics"
## [24] "STAT 491 - Intro to Bayesian Stats"
## [25] "STAT 446 - Sampling"
## [26] "STAT 436/536 - Time Series"
## [27] "STAT 408 - Statistical Computing and Graphical Analysis"
## [28] "More Information"
## [29] "Admissions"
## [30] "Current Students"
## [31] "Faculty & Staff"
## [32] "Parents & Family"
## [33] "Alumni"
## [34] "Resources"
## [35] "Accessibility"
## [36] "Contact List"
## [37] "Directories"
## [38] "Jobs"
## [39] "Legal & Privacy Policy"
## [40] "Site Index"
## [41] "Follow Us"
## [42] "Facebook Twitter YouTube Instagram LinkedIn"
info <- andy %>% html_nodes('li') %>% html_text() %>%
str_replace_all(pattern = "\n", replacement = "")
info[str_detect(info, '@')]## [1] "Email: andrew.hoegh @ montana.edu"
river <- read_html("http://www.imdb.com/title/tt0105265/")
title <- river %>% html_nodes('title') %>% html_text() The movie title is A River Runs Through It (1992) - IMDb.
river <- read_html("http://www.imdb.com/title/tt0105265/")
story.line <- river %>%
html_nodes('#titleStoryLine') %>%
html_nodes('p') %>% html_text() %>%
str_replace_all(pattern = "\n", replacement = "")The storyline is : The Maclean brothers, Paul and Norman, live a relatively idyllic life in rural Montana, spending much of their time fly fishing. The sons of a minister, the boys eventually part company when Norman moves east to attend college, leaving his rebellious brother to find trouble back home. When Norman finally returns, the siblings resume their fishing outings, and assess both where they’ve been and where they’re going. Written byJwelch5742 .
http://www.montana.edu/marketing/about-msu/
info_table <- read_html("http://www.montana.edu/marketing/about-msu/") %>% html_nodes('table') %>% html_table()
kable(info_table[[2]])| 2019 / 2020 | Resident | Nonresident |
|---|---|---|
| Tuition/Fees | $7,320 | $25,850 |
| Room/Board | $10,300 | $10,300 |
| Books/Supplies | $1,450 | $1,450 |
| Total Estimated Cost | $19,070 | $37,600 |